Speeding Up q-Gram Mining on Grammar-Based Compressed Texts

نویسندگان

  • Keisuke Goto
  • Hideo Bannai
  • Shunsuke Inenaga
  • Masayuki Takeda
چکیده

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T ), where dup(q,T ) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is O(min{|T |−dup(q, T ), qn}), improving our previous O(qn) algorithm when q = Ω(|T |/n).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algorithms and data structures for grammar - compressed strings

This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...

متن کامل

Data Structures for Grammar-compressed Strings

This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...

متن کامل

Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts

Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all qgrams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, ...

متن کامل

Compact q-Gram Profiling of Compressed Strings

We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O(N ↵) expected time and uses O(n+kT,q) space, where N ↵  qn is the exact number of characters decompressed by the algorithm and kT,q  N ↵ is the number of distinct q-grams in T . This simultaneously matches the curr...

متن کامل

Fast q-gram Mining on SLP Compressed Strings

We present simple and efficient algorithms for calculating qgram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T , we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T . Computational experiments show that our algorithm and its variation are pract...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012